Design and Experience: Using the Intel® Itanium® 2 Processor Performance Monitoring Unit to Implement Feedback Optimizations
نویسندگان
چکیده
Historically, profile-guided optimization has gathered its profile data by executing an instrumented binary and capturing the output. While this approach enables the collection of function and basic block frequencies, it cannot extract microarchitectural event information such as cache activity, TLB activity, and branch prediction behavior. Using instrumentation also requires that programs be compiled with different options (one for the profile run, one for the optimization run) with the profiling run taking substantially longer due to instrumentation overhead and reductions in compiler optimization. To help address these issues, the Intel® Itanium® 2 processor has extensive hardware support to allow for highly accurate instruction-specific information to be gathered from any binary. In this paper, we cover three broad topics: the Itanium® 2 processor performance monitoring unit (PMU), our tools and methodology to gather and process cache, TLB, and branch activity information, and a case study where we demonstrate the entire system to reduce data access stalls.
منابع مشابه
The Itanium 2 Processor Extends the Processing Power of the Itanium Processor Family with a Capable and Balanced Microarchitecture. Executing up to Six Instructions at a Time, It Provides Both Performance and Binary Compatibility
On 8 July 2002, Intel introduced the Itanium 2 processor—the Itanium architecture’s second implementation. This event was a milestone in the cooperation between Intel and Hewlett-Packard to establish the Itanium architecture as a key workstation, server, and supercomputer building block. The Itanium 2 processor may appear similar to the Itanium processor, yet it represents significant advances ...
متن کاملA 32nm 3.1 billion transistor 12-wide-issue Itanium® processor for mission-critical servers
July 2011 Revision 1.1 Poulson High Level Summary The next generation in the Intel® Itanium® processor family, code named Poulson, has eight multi-threaded 64 bit cores. Poulson is socket compatible with the current Intel® Itanium® Processor 9300 series (Tukwila) [1]. The new design integrates a ring based system interface derived from portions of previous Xeon® and Itanium® processors, and inc...
متن کاملPractical Compiler Techniques on Efficient Multithreaded Code Generation for OpenMP Programs
State-of-the-art multiprocessor systems pose several difficulties: (i) the user has to parallelize the existing serial code; (ii) explicitly threaded programs using a thread library are not portable; (iii) writing efficient multi-threaded programs requires intimate knowledge of machine’s architecture and micro-architecture. Thus, well-tuned parallelizing compilers are in high demand to leverage...
متن کاملCompilation for the Itanium Processor
This paper describes a just-in-time (JIT) Java1 compiler for the Intel Itanium processor. The Itanium processor is an example of an Explicitly Parallel Instruction Computing (EPIC) architecture and thus relies on aggressive and expensive compiler optimizations for performance. Static compilers for Itanium use aggressive global scheduling algorithms to extract instruction-level parallelism. In a...
متن کاملOn the Predictability of Program Behavior Using Different Input Data Sets
Smaller input data sets such as the test and the train input sets are commonly used in simulation to estimate the impact of architecture/micro-architecture features on the performance of SPEC benchmarks. They are also used for profile feedback compiler optimizations. In this paper, we examine the reliability of reduced input sets for performance simulation and profile feedback optimizations. We...
متن کامل